[Attention][UX][1/N] Add AttentionConfig and change attention env var…#30
[Attention][UX][1/N] Add AttentionConfig and change attention env var…#30MitchLewis930 wants to merge 1 commit intoconfig_warning_beforefrom
Conversation
…s to CLI arguments (vllm-project#26315) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
📝 WalkthroughWalkthroughA comprehensive refactoring replaces environment variable and context-manager based attention backend configuration with a unified AttentionConfig class integrated into VllmConfig. Backend selection, flash attention versioning, quantization toggles, and optimization flags now derive from configuration layers throughout the attention system and engine setup. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vllm/platforms/cuda.py (1)
175-211: Default FlashMLA intent doesn’t match auto-selection.When
attention_config.backendisNoneon non‑Blackwell MLA, the code setsuse_flashmla = Truebut leaves the backend unset. Auto-selection still prioritizesFLASH_ATTN_MLA, while block-size forcing assumes FlashMLA. Consider either setting the backend when FlashMLA is actually intended or adjusting the comment/logic to reflect the real default.✅ Suggested fix (set backend only when FlashMLA is supported)
- if ( - use_flashmla - and is_flashmla_dense_supported()[0] - and cache_config.block_size % 64 != 0 - ): - cache_config.block_size = 64 - logger.info("Forcing kv cache block size to 64 for FlashMLA backend.") + if use_flashmla and is_flashmla_dense_supported()[0]: + if vllm_config.attention_config.backend is None: + vllm_config.attention_config.backend = ( + AttentionBackendEnum.FLASHMLA + ) + if cache_config.block_size % 64 != 0: + cache_config.block_size = 64 + logger.info( + "Forcing kv cache block size to 64 for FlashMLA backend." + )
♻️ Duplicate comments (1)
vllm/attention/layer.py (1)
339-342: Same concern as above — attribute vs method.
🧹 Nitpick comments (4)
tests/v1/kv_connector/unit/test_nixl_connector.py (1)
1154-1176: Consider adding a brief comment explaining the layout detection strategy.The logic correctly identifies the KV cache layout by probing with
num_blocks=1. However, the reasoning behindtest_shape[0] == 1determiningis_blocks_firstis non-obvious. A comment would help future maintainers understand why this works:
- FlashAttn shape:
(2, num_blocks, ...)→ whennum_blocks=1,shape[0]=2- Triton shape:
(num_blocks, 2, ...)→ whennum_blocks=1,shape[0]=1📝 Suggested comment
# Store tensor info for validation test_shape = backend_cls.get_kv_cache_shape( num_blocks=1, block_size=16, num_kv_heads=1, head_size=1 ) + # Detect if num_blocks is the first dimension by probing with num_blocks=1. + # FlashAttn: (2, num_blocks, ...) → shape[0]=2; Triton: (num_blocks, 2, ...) → shape[0]=1 is_blocks_first = len(test_shape) == 5 and test_shape[0] == 1vllm/v1/attention/backends/mla/flashattn_mla.py (1)
124-135: Validateflash_attn_max_num_splits_for_cuda_graphis positive. A zero/negative value could lead to invalid scheduling or buffer sizing during CUDA graph capture. Consider validating before assignment.♻️ Proposed fix
if self.use_full_cuda_graph and self.fa_aot_schedule: self.scheduler_metadata = torch.zeros( vllm_config.scheduler_config.max_num_seqs + 1, dtype=torch.int32, device=self.device, ) # When using cuda graph, we need to set the upper bound of the # number of splits so that large enough intermediate buffers are # pre-allocated during capture. - self.max_num_splits = ( - vllm_config.attention_config.flash_attn_max_num_splits_for_cuda_graph - ) + max_splits = ( + vllm_config.attention_config.flash_attn_max_num_splits_for_cuda_graph + ) + if max_splits <= 0: + raise ValueError( + "flash_attn_max_num_splits_for_cuda_graph must be > 0" + ) + self.max_num_splits = max_splitstests/v1/worker/test_gpu_model_runner.py (1)
938-956: Remove duplicate Mamba KV-cache verification loop.The second loop repeats the exact same assertions and can be dropped to reduce test time/noise.
♻️ Suggested cleanup
- for layer in [layer_2, layer_3, layer_4, layer_5]: - for i, kv_block in enumerate(kv_blocks_for_mamba): - actual_conv = vllm_ctx[layer].kv_cache[0][0][kv_block, :] - actual_ssm = vllm_ctx[layer].kv_cache[0][1][kv_block, :] - expected_conv = conv_blocks_constant[i] - expected_ssm = ssm_blocks_constant[i] - - assert torch.equal(actual_conv, expected_conv) - assert torch.equal(actual_ssm, expected_ssm)vllm/attention/selector.py (1)
60-70: Consider adding a type annotation for thebackendparameter.The
backendparameter lacks a type annotation, while other parameters are properly annotated. This reduces code clarity and IDE support.Suggested fix
`@cache` def _cached_get_attn_backend( - backend, + backend: str | None, head_size: int, dtype: torch.dtype,Note: Adjust the type based on the actual type of
attention_config.backend(likely an enum or string).
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
tests/compile/test_fusion_attn.pytests/v1/kv_connector/unit/test_nixl_connector.pytests/v1/worker/test_gpu_model_runner.pyvllm/attention/backends/abstract.pyvllm/attention/layer.pyvllm/attention/selector.pyvllm/attention/utils/fa_utils.pyvllm/config/__init__.pyvllm/config/attention.pyvllm/config/model.pyvllm/config/vllm.pyvllm/engine/arg_utils.pyvllm/model_executor/models/config.pyvllm/model_executor/models/vision.pyvllm/platforms/cuda.pyvllm/utils/flashinfer.pyvllm/v1/attention/backends/flash_attn.pyvllm/v1/attention/backends/flashinfer.pyvllm/v1/attention/backends/mla/common.pyvllm/v1/attention/backends/mla/flashattn_mla.pyvllm/v1/attention/backends/rocm_attn.pyvllm/v1/attention/backends/triton_attn.py
💤 Files with no reviewable changes (1)
- vllm/config/model.py
🧰 Additional context used
🧬 Code graph analysis (16)
vllm/config/__init__.py (1)
vllm/config/attention.py (1)
AttentionConfig(18-114)
vllm/attention/utils/fa_utils.py (2)
vllm/config/utils.py (1)
config(35-48)vllm/config/vllm.py (1)
get_current_vllm_config(1310-1317)
vllm/v1/attention/backends/mla/common.py (1)
vllm/config/vllm.py (1)
get_current_vllm_config(1310-1317)
vllm/config/vllm.py (1)
vllm/config/attention.py (1)
AttentionConfig(18-114)
vllm/model_executor/models/vision.py (1)
vllm/config/vllm.py (2)
VllmConfig(175-1247)get_current_vllm_config(1310-1317)
vllm/engine/arg_utils.py (4)
vllm/config/attention.py (1)
AttentionConfig(18-114)vllm/config/utils.py (1)
get_field(51-73)vllm/config/vllm.py (1)
VllmConfig(175-1247)vllm/attention/backends/registry.py (1)
AttentionBackendEnum(34-123)
vllm/v1/attention/backends/triton_attn.py (1)
vllm/platforms/interface.py (1)
is_cuda(150-151)
vllm/platforms/cuda.py (2)
vllm/attention/backends/registry.py (1)
AttentionBackendEnum(34-123)vllm/platforms/interface.py (1)
is_device_capability(280-301)
tests/v1/worker/test_gpu_model_runner.py (3)
vllm/attention/backends/registry.py (1)
AttentionBackendEnum(34-123)vllm/config/attention.py (1)
AttentionConfig(18-114)vllm/config/vllm.py (1)
VllmConfig(175-1247)
vllm/v1/attention/backends/flashinfer.py (2)
vllm/config/vllm.py (2)
VllmConfig(175-1247)get_current_vllm_config(1310-1317)vllm/utils/flashinfer.py (1)
can_use_trtllm_attention(282-287)
vllm/config/attention.py (3)
vllm/attention/backends/registry.py (1)
AttentionBackendEnum(34-123)vllm/config/utils.py (3)
config(35-48)get_hash_factors(276-294)hash_factors(297-299)vllm/logger.py (2)
init_logger(206-216)warning_once(138-147)
tests/v1/kv_connector/unit/test_nixl_connector.py (7)
vllm/attention/backends/abstract.py (1)
get_kv_cache_shape(71-78)vllm/v1/attention/backends/flash_attn.py (1)
get_kv_cache_shape(102-111)vllm/v1/attention/backends/flashinfer.py (1)
get_kv_cache_shape(306-313)vllm/v1/attention/backends/mla/common.py (1)
get_kv_cache_shape(302-309)vllm/v1/attention/backends/rocm_attn.py (1)
get_kv_cache_shape(181-190)vllm/v1/attention/backends/triton_attn.py (1)
get_kv_cache_shape(177-186)vllm/distributed/kv_transfer/kv_connector/utils.py (1)
block_size(233-234)
vllm/model_executor/models/config.py (1)
vllm/attention/backends/registry.py (1)
AttentionBackendEnum(34-123)
vllm/attention/selector.py (1)
vllm/config/vllm.py (1)
get_current_vllm_config(1310-1317)
tests/compile/test_fusion_attn.py (1)
vllm/config/attention.py (1)
AttentionConfig(18-114)
vllm/utils/flashinfer.py (3)
vllm/config/utils.py (1)
config(35-48)vllm/config/vllm.py (1)
get_current_vllm_config(1310-1317)vllm/logger.py (1)
info_once(129-136)
🪛 Ruff (0.14.13)
vllm/engine/arg_utils.py
531-531: Do not perform function call get_field in dataclass defaults
(RUF009)
1718-1721: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (47)
tests/v1/kv_connector/unit/test_nixl_connector.py (2)
1233-1237: LGTM!The block length calculation correctly accounts for the different tensor layouts. For Triton (blocks-first), the full tensor size needs division by both
num_blocksand 2 (K/V split). For FlashAttn,expected_tensor_sizeis already halved, so only division bynum_blocksis needed. Both approaches correctly yield the per-block-per-KV size.
1211-1211: LGTM!The assertion correctly uses the dynamically computed
expected_num_entriesinstead of a hardcoded value, allowing the test to validate both FlashAttn (4 entries) and Triton (2 entries) cache registration patterns.vllm/v1/attention/backends/rocm_attn.py (1)
162-169: Clearer CLI guidance for unsupported head sizes. Looks good.vllm/config/__init__.py (2)
4-5: Import ofAttentionConfigis appropriate.
49-52:__all__exposure forAttentionConfiglooks correct.vllm/attention/utils/fa_utils.py (1)
51-56: Config-driven FA version override looks good.vllm/model_executor/models/vision.py (2)
14-15: Config import is appropriate here.
94-96: Config-based backend selection is consistent with the refactor.vllm/attention/layer.py (1)
303-307: The concern raised in this review is unfounded. The base classAttentionImpldefinessupports_quant_query_inputas a bool class attribute with a default value ofFalse(vllm/attention/backends/abstract.py:300). All subclasses that override it do so correctly with boolean attribute assignments, never method definitions. Therefore, attribute access at lines 306 and 341 will always return a bool value, and the check is safe.Likely an incorrect or invalid review comment.
vllm/attention/backends/abstract.py (1)
292-301: LGTM: capability flag is clear and consistent.The attribute-based capability flag aligns well with the config-driven checks and avoids per-call method overhead.
vllm/v1/attention/backends/triton_attn.py (1)
262-263: LGTM: instance flag preserves prior behavior.Setting the flag in
__init__keeps parity with the previous method-based check.vllm/config/vllm.py (3)
30-31: LGTM: AttentionConfig import is in place.Clean wiring for the new config type.
196-197: LGTM: attention_config default factory is appropriate.Ensures the config is always present without requiring callers to pass it.
285-288: LGTM: hashing includes attention_config.Good to include attention settings in the computation-graph hash factors.
vllm/model_executor/models/config.py (2)
7-8: LGTM: enum-based backend selection.Improves type safety over string/env checks.
334-369: LGTM: config-driven backend checks read cleanly.Using
attention_config.backendand allowingNonekeeps the logic consistent with the new config flow.tests/compile/test_fusion_attn.py (2)
20-22: LGTM: test imports align with new public API.Clear and direct use of
AttentionConfig.
338-339: LGTM: explicit attention_config improves test determinism.Backend is now controlled via config instead of external state.
vllm/v1/attention/backends/mla/common.py (3)
438-450: Config-driven FlashInfer prefill gating looks good.The new AttentionConfig-based checks keep the existing selection logic while centralizing the flags.
453-462: CUDNN prefill selection remains consistent under config control.Reads from AttentionConfig as expected and preserves the existing capability checks.
465-474: TRTLLM ragged DeepSeek prefill gating is aligned with new config flow.Looks consistent with the updated configuration source.
vllm/engine/arg_utils.py (8)
36-38: Import wiring for AttentionConfig is straightforward.No concerns with the added config import.
547-548: Top-levelattention_backendoverride is cleanly surfaced.The extra field provides a simple CLI path while still deferring to AttentionConfig merging later.
584-587: Dict-to-AttentionConfig coercion is a nice usability touch.This mirrors other config coercions and keeps EngineArgs flexible.
725-734: New AttentionConfig CLI group is nicely scoped.Adds the expected
--attention-backendsurface without polluting other groups.
1158-1160: Passing--attention-configthrough VllmConfig is consistent with other config knobs.The dataclass JSON route looks aligned with existing patterns.
1714-1723: Mutual-exclusion logic for attention_backend vs attention_config.backend is solid.This prevents ambiguous overrides and keeps config resolution deterministic.
1781-1783: AttentionConfig is now threaded into VllmConfig as intended.Good to see it wired into the final config assembly.
531-531: Do not apply the suggested fix; it breaks the semantic intent of the code.While RUF009 is correctly triggered, the suggested fix of using
field(default_factory=AttentionConfig)is semantically incorrect. Theget_field()function (fromvllm.config.utils) does not simply create a field with the type as a factory—it extracts the actualdefault_factoryfrom the source dataclass field. Usingfield(default_factory=AttentionConfig)would callAttentionConfig()directly rather than using the configured default factory fromVllmConfig.attention_config, changing the runtime behavior.The correct approach is to suppress the RUF009 rule for this line (e.g.,
# noqa: RUF009) or refactor the code to avoid the function call in the default, rather than replacingget_field()with an incompatible alternative.Likely an incorrect or invalid review comment.
vllm/utils/flashinfer.py (2)
270-279: Config-backed TRTLLM override hook looks good.Returning the AttentionConfig flag keeps the override source centralized.
305-369: Updated TRTLLM selection messaging and gating are consistent.The explicit enable/disable paths now reference the new CLI/config flag appropriately.
tests/v1/worker/test_gpu_model_runner.py (3)
8-13: Imports updated for config-driven backend selection.No issues with the new test dependencies.
811-818: Test now uses AttentionConfig to select FLASHINFER backend.This aligns with the new config-driven attention selection path.
827-857: Scoped set_current_vllm_config usage looks correct.Keeps the static forward context and runner initialization consistent with the active config.
vllm/config/attention.py (4)
16-49: AttentionConfig fields cover the needed backend and kernel knobs.Good consolidation of attention-related flags into a single config object.
51-63: compute_hash integration looks appropriate for graph-affecting knobs.This should keep caching keyed to attention configuration changes.
65-71: String-to-enum backend parsing is a useful usability improvement.Nice touch for CLI/config inputs.
73-114: Env-var migration warnings and overrides are handled cleanly.Centralizing the deprecation path here keeps callers simpler.
vllm/v1/attention/backends/flashinfer.py (4)
29-29: LGTM!Import addition is consistent with the codebase pattern for accessing vLLM configuration.
364-366: LGTM!Comment accurately reflects the config-driven approach for disabling TRTLLM attention.
503-513: LGTM!The config-driven q_data_type selection logic is correct. The condition properly checks both TRTLLM availability and the quantization disable flag before selecting the appropriate data type.
1041-1045: LGTM!The attribute initialization correctly combines TRTLLM support check with the config-driven quantization disable flag. The use of
get_current_vllm_config()follows the established pattern in this codebase.Note:
get_current_vllm_config()may log a warning and return a default config during CI/testing when the config is not explicitly set, but this is the expected behavior per the function's design.vllm/v1/attention/backends/flash_attn.py (3)
266-266: LGTM!Storing
attention_configreference follows the same pattern as other config fields in this builder (e.g.,model_config,cache_config).
307-309: LGTM!Config-driven access for
flash_attn_max_num_splits_for_cuda_graphreplaces the environment variable correctly.
559-559: LGTM!FlashAttention unconditionally supports quantized query input, which differs from FlashInfer's conditional support. Converting from a method to an attribute is consistent with the refactoring across attention backends.
vllm/attention/selector.py (2)
41-56: LGTM!The refactoring correctly retrieves the attention backend from
vllm_config.attention_config.backendand passes it to the cached function. The import is placed inside the function to avoid circular imports, which is a common pattern in this codebase.
80-103: LGTM!The
backendparameter is correctly passed toget_attn_backend_clsin both code paths (with and without the deprecateduse_v1parameter).
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Test
Summary by CodeRabbit
Release Notes
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.